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In a multicellular organism different cell types express a gene in different amounts. Samples from 
which gene expression levels can be measured typically contain a mixture of different cell types, 
the resulting measurements thus give only averages over the different cell types present. Based 
on fluctuations in the mixture proportions from sample to sample it is in principle possible to 
reconstruct the underlying expression levels of each cell type: to deconvolute the sample. We use 
a statistical mechanics approach to the problem of deconvoluting such partial concentrations from 
mixed samples, give analytical results for when and how well samples can be unmixed, and suggest 
an algorithm for sample deconvolution. 



I. INTRODUCTION 

Organs in higher organisms are complex tissues containing a variety of different cell types. Brain tissue, for instance, 
contains not only neurons, but also supporting cells as the astrocytes and oligodendrocytes. Kidney tissue contains 
the filtering units (podocytes) as well as cells of the capillary system (tubules). Whereas two cells of different cell 
type have largely the same DNA sequence, only a cell-type specific set of genes will be expressed in a cell [ 1 , 2]. 

Over the last two decades, experimental methods have been developed which allow to measure the amount of 
m(essenger)-RNA from different genes in a sample [ii, 4]. However, one may not be interested in expression levels 
averaged over all cell types in such a sample, but may want to know the mRNA levels present in the different cell 
types. A particularly pressing example arises in cancer research, where tissue samples typically contain solid tumour 
and healthy tissue in unknown proportions [5]. 

Denoting the proportion of cell type a in sample /i by and the concentration of mRNA from gene i in cell type 
a by the concentration of mRNA from gene i in sample X'^ is given by 

n 

xr = 5]p^,"+^r , (1) 

a=l 

where the so-called residuals stem from sample-specific fluctuations of concentrations or random experimental 
errors. The number of different cell types is denoted by n. Additionally, we have the constraints < , < < 1 
andEaP^^-lV^i. 

Sample deconvolution [ ] is the inverse problem of reconstructing the concentrations of gene products in each cell 
type, as well as the mixing proportions of the samples from measurements of mixed samples . The information 
necessary for this reconstruction must come from fluctuations in the mixing proportions across samples: A cell type 
with high concentrations of mRNA of genes i and j will induce positively correlated fluctuations of the measurements 
and X'^ as the fraction of this cell type varies from sample to sample (see fig. 1). 

Equation (1) also arises in a broad range of contexts outside molecular biology. It is the fundamental equation of 
factor analysis^ a statistical approach where different unknown factors Xa contribute with linear weights Pa to some 
outcome X [7] . Applications arise for example in the context of face recognition [S] , data analysis in ecology [9] , or 
fiuorescence microscopy [IIJ, 11]. 

There are two questions we address in this paper. First, under what conditions is such learning from fluctuations 
possible at all? And second, how accurately can the reconstruction be made? We first discuss a general constraint 
on the minimal number of samples needed from linear algebra. We then formulate a Bayesian model for sample 
deconvolution and study this model from the point of view of statistical physics. We derive analytical results for 
the accuracy of reconstruction for a simple, non-trivial case of the problem. This model also suggests a simple 
general algorithm based on Markov chain Monte Carlo (MCMC) sampling. Comparison with an established class 
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FIG. 1: The sample deconvolution problem, a) A tissue sample containing a mixture of different cell types is taken. The 
concentration of a particular gene product in the sample is a linear combination of the concentrations in each cell type, b) Two 
different tissue samples generally contain different mixtures of cell types, while the concentration of gene products is largely 
constant across cells of a given type, c) Part of the variation of expression levels across samples is due to fluctuations in the 
mixing proportions. This provides the basis for reconstructing concentrations in each cell type and mixing proportions. 



of algorithms [12-14], which formulate sample deconvolution as an optimisation problem, demonstrates a generic 
drawback of such optimization approaches. 

As an initial remark, we derive a constraint on the minimal number of samples needed for reconstruction from 
linear algebra. Looking at the number of variables in eq. (1) we have, neglecting the residuals, the matrix equation 
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Denoting the total number of samples by M, the number of genes by N and the number of cell types by n, there are 
MN measurements on the left and M(n—l)+nN unknown variables on the right-hand side. If the number of unknown 
variables exceeds the number of data points, the system of equations is underdetermined. For a measurement of gene 
expression levels, the number N of genes will typically be of the order of hundreds or even thousands, exceeding by 
far the number n of cell types in a sample, or the number M of samples. For N ^ (n, M) the condition not to have 
an underdetermined set of equations reduces to M > n, so the number of samples taken has to be at least larger than 
the number of cell types. 



II. BAYESIAN MODEL FOR SAMPLE DECONVOLUTION 



For concreteness we assume that the residuals in (1) are independent and identically distributed Gaussian 
variables. Other distributions will be discussed below. These residuals stem from fluctuations in the concentrations of 
the same cell type ('biological noise') and random experimental error ('technical noise'). Given the mixing proportions 
and the concentrations in cell types a;°, this distribution of residuals induces the distribution of the measurements 

xt 

M N 

P(X|p,x) = (27ra|)-^ ^-^(P'-^^) , (3) 

with the Hamiltonian 

?^(p,x;X)^^ ^^-''£f°^'^' . (4) 
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Bold symbols are used to denote the corresponding matrices. The quantity of interest, namely the probability of a 
given set of mixtures and of concentrations in cell types given the measurements, follows from Bayes theorem; the 
so-called posterior probability P (p, expressed in terms of the likelihood (3), the prior P (p, x) 

and the marginal likelihood P (X) (which for a given set of measurements is just a multiplicative constant). 

For the prior P (p, x) a particular choice has to be made. To keep the analytical calculations tractable, we choose 
again independent Gaussian distributions with means Xa/pa and variances cr'^a/'^paj where we allow for different 
distribution parameters for each cell type. Here, we assume that the distribution parameters are chosen such that 
the contributions of values xf < and < or > 1 are negligibly small. In the general algorithm we discuss 
below, any distribution can be implemented. The constraint that mixing proportions for each sample add up to one 
is implemented using a delta function. For this particular choice of the prior we get 
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with the Hamiltonian 
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The first term in eq. (6) comes from the likelihood, penalizing the deviation of a possible solution {Pai^f} from the 
measurement {^f }• The second and third term are the Gaussian priors for the mixing proportions and the expression 
patterns, respectively. The posterior distribution (5) describes the state of our knowledge of the mixing proportions 
and concentrations in each cell type, given the measurements. From the perspective of statistical physics, the mixing 
proportions and concentrations in each cell type define a phase space, and the posterior distribution (5) defines a 
Hamiltonian (6) describing how strongly the probability measure of mixing proportions/concentrations is focused in 
particular parts of this phase space. The partition function 
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with Trp^x = / dp / dx (5 (^^'^j^ — 1) sets out the statistical mechanics of sample deconvolution at /3 = 1. The 
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is a measure of the uncertainty of reconstruction. 



III. PARTITION FUNCTION AND THEORETICAL RECONSTRUCTION ACCURACY 

We can now address the theoretical limit of sample deconvolution. For a given set of measurements X, the statistics 
of mixtures and concentrations is described by the partition function (7). Suppose now that these measurements 
are generated using (3) from underlying mixing proportions p^ and concentrations X"^. These are the targets the 
reconstruction aims for. In order to explore the behaviour of the system for typical realisations of these targets, 
the quenched average of InZx over p-^ and x-^ needs to be computed [1">]. We restrict ourselves to the so-called 
annealed approximation and calculate the average of Zx over the target mixtures and concentrations. We will show 
numerically that the difference between quenched and annealed results are small for a reasonable choice of parameters. 
The annealed average over Zx is given by 



((^x)) 
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(8) 



The average ((.)) is over all data matrices weighted by their probabilities under the generative model P(p-^, x-'", ^) = 
exp{-^^^-^(x--x)^ 

partition function Zx leads to a large set of Gaussian integrals, which in the thermodynamic limit N ^ oo can be 
evaluated using the saddle-point approximation [l-^-IS]. For the thermodynamic limit we consider the scaling ansatz 
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4 



M = aN and (t| = a'^N. In a concrete situation N is of course finite, and a, cr^ will be small (but also finite). In a 
lengthy but standard calculation we obtain the saddle point equations 



€b^ -^{plPb)p,pT , (llb = {xlxb)^^^T , (9) 
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The g-variables in eq. (9) are the order parameters of the system. They are connected to the Euclidean distance 
between the reconstruction and its target, = :^J2ai{^i~^I°'Y ^'^'l similarly rp — :;^J2an{Pa~Pa'^Y^ 
through the relationships ^ ^J2a {laa - 2gJ„ + qJJ) and rp ^ ^J2a (^^a - 2(?J„ + gJJ). These saddle-point 
equations have to be solved numerically. We note the formal similarity of the number of cell types with replicas used 
to calculate quenched averages. A quenched calculation of this system would thus result in a system of equations 
bearing a two-replica structure [19]. 

Solving the saddle-point equations (9) numerically for the first non-trivial case of n = 2 cell types, we are particularly 
interested in the difference between target and the reconstructed mixtures and concentrations and how this difference 
depends on the number of samples M. To this end, we define a normalized order parameter r" = which is zero 
for perfect reconstruction and one for random guessing from the prior distribution of the target solution. Figure 2 
shows how the reconstruction improves with increasing number of samples M. This is not surprising as each additional 
sample brings different mixing proportions and induces correlations in the measurements X^^ across genes, from which 
both concentrations in the cell types and mixing proportions can be reconstructed. There is no finite threshold in the 
number of samples below which reconstruction is impossible, at any non-zero value of a = M/N the reconstruction 
is better than a random choice of concentrations and mixtures. 

To compare these results with numerical simulations, we evaluated the quenched average by drawing target mixtures 
and concentrations from the Gaussian prior distribution and the measurements generated from (3). Then Markov 
chain Monte Carlo (MCMC) sampling is used to draw the reconstructed mixtures and concentrations from the 
Boltzmann posterior distribution (6). We also simulated the annealed average by including prior for the target 
mixtures/concentrations into the Hamiltonian and sampling over the target mixtures/concentrations as well. Very 
good agreement between these numerical and the analytical results is seen in Fig. 2. 



5 




10 



A numerical quenched 
O numerical annealed 
^— analytical 



6 



10 



-3 



1 



1000 



10' 



,6 



a 



FIG. 2: Numerical simulations corresponding to annealed and quenched average are in good agreement with the analytical 
annealed result. With the increase of the fraction of samples a the reconstruction accuracy improves from random guessing 
(r" = 1) to a perfect reconstruction (r" = 0). Annealed and quenched simulation results are in good agreement, indicating 
the annealed calculation to be a good approximation to the quenched calculation in this case. The parameters are chosen as: 
M = 1 — 500 at A'^ = 500, n — 2, p = 1/3 and ap = 0.05 for all cell types equally, x — 3 and CTj; = 1 also for all cell types 
equally, = 0.1. 



The Boltzmann posterior distribution (5) with the Hamiltonian (6) suggests itself as the basis for a simple sample 
deconvolution algorithm based on a particular set of measurements X^. Using MCMC sampling of this poste- 
rior [20], the entire space of reconstructions can be explored weighted with the posterior probability. An arbitrary 
starting configuration Cq = {p,x} is chosen and from this starting point, a new neighbouring configuration Ci is 
generated by randomly increasing or decreasing one of the free parameters by a small amount within the positivity 
constraints and the constraint on the mixing proportions. The new configuration is accepted with the probability 
Pace = min (l, e^^^i"'^"') with energies Hq/Hi corresponding to configuration Cq and Ci, respectively (Metropolis 
rule). Since only the ratio of the posterior probability of two configurations enters, the marginal likelihood P (^) 
does not enter the algorithm. 

We test this algorithm on artificially created datasets. For this purpose a target solution x-^ and p"^ is generated, 
drawn randomly from Gaussian distributions with means /p^ and variances crj^/crj^. Then the measurements 
are generated according to eq. (1) by adding Gaussian noise of variance aj"^. The Metropolis rule is used to sample 
the posterior (5) of the mixtures p^, the concentrations xf, and the parameters describing the prior Xa/pa, <^x a/^p a 
and i7|s. In this way only the general shape of the prior distribution needs to be chosen, the actual parameters of the 
distribution are estimated from the data. 

This MCMC sampling allows to sample the regions in phase space where the posterior probability is high. An 
estimate of the mixtures and concentrations is provided by the mean values of and obtained by averaging 
over a large number of configurations visited during the sampling process. In addition to this posterior mean, we 
also compute the standard deviations of xf and under the posterior. These deviations quantify the remaining 
uncertainty in the reconstruction, given the limited amount of data and the noise on the data. They can serve as 
error estimates of the reconstruction when the target solution is unknown. In Fig. 3 we show for each variable xf the 
mean and standard deviation (as error bars) under the posterior (5) against the targets 

The results of this Bayesian sample deconvolution can be compared with those of a different class of sample 
deconvolution algorithms based non-negative matrix factorization (NMF) to invert the relationship X = p • x [12- 
14]. Starting with an initial guess of x, the matrix p is calculated that minimizes the ^2-iiorm ||X — p • x||f = 



^ (A"!" — X^aP'J'^i') (Frobenius norm). The minimization proceeds under the additional constraint of non-negative 



matrix entries. From this estimate of p an improved estimate for x is obtained by applying the procedure in turn to 
X. Iterating these steps the distance ||X — px||f of the reconstructed solution from the data matrix never increases. 
This corresponds to a minimization of "Hl in eq. (4). Convergence of the iterative scheme is thus guaranteed and 
reaches a local minimum of the Hamiltonian (4). 



IV. A SAMPLE DECONVOLUTION ALGORITHM 
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FIG. 3: Reconstruction by Bayesian algorithm (left plots) and the deconf algorithm (right plots). The reconstruction estimates 
are obtained from the same target solution x'^ for all plots. The reconstruction with the Bayesian algorithm is clearly more 
accurate: the reconstruction accuracy = 

^ Ea ^ i^f" - ^I") of the Bayesian algorithm using 5 samples is comparable 
to the accuracy of the deconf algorithm using 30 samples. Additionally, the standard deviation of the posterior distribution 
serves as a natural measure for uncertainty of the estimate, giving rise to the individual errorbars in the case of the Bayesian 
algorithm. Without knowing x'^ they still provide a good estimate for the accuracy of the reconstruction. Parameters used for 
both algorithms are A'' = 500, n — 3, M = 5, 10 and 30, p"^ = 1/3 and — 0.05 for all cell types equally, xF = Z and — 1 
also for all cell types equally, aj — Q.l. In order to distinguish single points with their corresponding error bars the values of 
only one in twenty of the 500 genes are plotted. 



As an example from the class of NMF algorithms we applied the deconf algorithm [12] to the same artificial data as 
used for the Bayesian algorithm. Fig. 3 shows how the deconf is outperformed by the Bayesian approach. 30 samples 
are needed by deconf to reach a similar reconstruction accuracy as the Bayesian algorithm using only 5 samples. 
Another NMF algorithm we tried [14] achieved an even lower accuracy. 

Of course prior information in (6) on the distribution of targets facilitates the reconstruction for the Bayesian 
algorithm. However, the prior is not responsible for the entire difference in performance between the Bayesian and 
the NMF approach. Fig. 4 shows that the effect of the prior is highest when the number of samples is small. This is 
to be expected, since the relative contribution of the prior to (6) increases when the amount of information coming 
from the measurements decreases. Even without using prior information, the Bayesian algorithm performs as well 
as the NMF algorithm in the low sample number regime. For an increasing number of samples the performance of 
sampling without prior approaches the performance with use of prior information. This is expected as well, since 
then the relative contribution of the prior to (6) becomes asymptotically negligible. NMF-approaches, formulated 
as optimization problems, give a point estimate in phase space that reproduces the matrix of measurements X as 
closely as possible leading to the well-known problem of overfitting [21, 22]. This is contrasted by the posterior mean 
estimate of the Bayesian approach, covering the entire phase space weighted by the posterior and leading to a more 
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FIG. 4: The prior information is most important when few samples are available. Then the Bayesian algorithm with prior 
(orange) clearly outperforms the NMF (gray) but also the Bayesian algorithm without prior (green). Without the use of prior 
information the Bayes algorithm is performing equally to the NMF algorithm for a small number of samples. With increasing 
sample number, the Bayes algorithm without prior learns much faster from the additional information, gradually approaching 
the performance of the Bayes algorithm with prior. Simulation parameters as in fig. .3. 



robust estimate. 

For the calculation of the partition function (8) we assumed the concrete case of Gaussian distributed residuals. For 
the algorithm, there is no loss of generality involved, any well-behaved probability density can be used in (5), leading 
generally to a Hamiltonian that is not quadratic. For the analytic calculation, Taylor expanding such a Hamiltonian 
around X would give T-L (p,x;X) = og + ai(X — px) + a2(X — px)^ + • • • . Our analytical calculation focusses 
exclusively on the second-order term. The first order term alone (apart from not being normalizable) , would induce 
no correlations between the targets p"^,x-^ and the estimated solution p,x. It may be possible, at least in principle, 
to evaluate the integrals arising from even polynomials beyond the second order, but the resulting expressions will 
not admit simple order parameters. The Gaussian distributed residuals (3) thus define the non-trivial yet tractable 
model of sample deconvolution. 

V. OUTLOOK 

In summary, we have developed a Bayesian model for reconstructing cell-type specific gene concentrations from 
samples containing an unknown mixture of cell types with unknown concentrations. Formulating the problem in 
the language of statistical mechanics, we have obtained an analytical solution for the reconstruction accuracy using 
the annealed approximation for the specific case of n = 2 cell types. This solution can be extended easily for any 
finite number of cell types. Our model also suggest a simple algorithm based on the MCMC sampling of the solution 
space weighted by the posterior distribution of the Bayesian model. This turns out to outperform methods based on 
minimizing the distance between the matrix of measurements X and the matrix product of mixing proportions p and 
concentrations x. 

As ever, the proper choice of the prior may be a delicate step, and Gaussian priors need not optimally describe 
real datasets. A study based on experimental datasets would be needed to settle this issue. But if a better choice 
for the prior is found, it would be straightforward to implement into the algorithm, leaving the rest of the Bayesian 
framework unchanged. 
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